Regular Expressions in R

Author

Martin Schweinberger

Introduction

This tutorial introduces regular expressions (regex) and demonstrates how to use them when working with language data in R. A regular expression is a special sequence of characters that describes a search pattern. You can think of regular expressions as precision search tools — far more powerful than simple find-and-replace — that let you locate, extract, validate, and transform text based on its structure rather than its exact content.

Regular expressions have wide applications across linguistics and computational humanities: searching corpora for inflected forms, extracting named entities, cleaning OCR output, tokenising text, validating annotation schemes, and building text-processing pipelines. Once mastered, they become one of the most versatile tools in any language researcher’s toolkit.

Prerequisite Tutorials

Before working through this tutorial, please complete or familiarise yourself with:


What This Tutorial Covers
  1. Basic characters — literal matching and the wildcard .
  2. Anchors — matching positions (start, end, word boundaries)
  3. Character classes — sets, ranges, and POSIX classes
  4. Quantifiers — repetition (*, +, ?, {n,m})
  5. Groups and alternation — capturing groups and |
  6. Special escape sequences\w, \d, \s, and their negations
  7. Lookahead and lookbehind — context-sensitive matching
  8. Key stringr functionsstr_detect(), str_extract(), str_replace(), and more
  9. Practical applications — corpus searches, text cleaning, extraction tasks
  10. Regex in dplyr pipelines — filtering and mutating with patterns
External Resources

For further study, the following resources are highly recommended:


Preparation and Session Set-up

Install required packages (once only):

Code
install.packages("stringr")  
install.packages("dplyr")  
install.packages("flextable")  
install.packages("checkdown")  

Load packages:

Code
library(stringr)     # string manipulation and regex functions  
library(dplyr)       # data frame manipulation  
library(flextable)   # formatted tables  
library(checkdown)   # interactive exercises  
  
options(stringsAsFactors = FALSE)  
options(scipen = 100)  

We will work with two types of objects throughout: a short example sentence for demonstrating individual patterns, and a longer example text representing realistic corpus data.

Code
# Short example sentence for basic demonstrations  
sent <- "The cat sat on the mat."  
  
# A longer example text: an excerpt about linguistics  
et <- paste(  
  "Grammar is the system of a language. People sometimes describe grammar as",  
  "the rules of a language, but in fact no language has rules. If we use the",  
  "word rules, we suggest that somebody created the rules first and then spoke",  
  "the language, like the rules of a game. But languages did not start like",  
  "that. Languages started when humans started to communicate with each other.",  
  "Grammars developed naturally. After some time, people described the grammar",  
  "of their languages. Languages change over time. Grammar changes too.",  
  "Children learn the grammar of their first language naturally. They do not",  
  "need to study it. Native speakers know intuitively whether a sentence is",  
  "grammatically correct or not. Non-native speakers often learn grammar rules",  
  "formally, through instruction. Prescriptive grammar describes how people",  
  "should speak, while descriptive grammar describes how people actually speak.",  
  "Linguists study grammars to understand language structure and acquisition.",  
  "The field of syntax deals with sentence structure, while morphology examines",  
  "how words are formed. Phonology studies sound systems in human languages.",  
  "Pragmatics investigates how context influences the interpretation of meaning.",  
  "Computational linguistics applies formal grammar to natural language processing.",  
  "Regular expressions are useful tools for searching and extracting patterns.",  
  "They can match words like 'cat', 'bat', or 'hat' with a single pattern."  
)  
  
# Split into individual tokens (words and punctuation)  
tokens <- str_split(et, "\\s+") |> unlist()  

Regular Expression Patterns

Section Overview

What you’ll learn: The building blocks of regular expressions — how each type of pattern works and what it matches

Key concept: Regular expressions describe structure, not content. [aeiou]{2,} matches any sequence of two or more vowels, regardless of which vowels or in which word.

Basic Characters

The simplest regular expression is a literal character — it matches exactly that character. A sequence of literal characters matches that exact sequence:

Code
# Literal match: does "cat" appear in the sentence?  
str_detect(sent, "cat")  
[1] TRUE
Code
# The dot . matches ANY single character except newline  
str_detect(sent, "c.t")    # matches "cat"  
[1] TRUE
Code
str_detect(sent, "m.t")    # matches "mat"  
[1] TRUE
Code
str_detect(sent, ".at")    # matches "cat", "sat", "mat"  
[1] TRUE

To match a literal dot (rather than “any character”), escape it with a double backslash:

Code
# Match a literal period at the end of the sentence  
str_detect(sent, "\\.")    # TRUE — the sentence ends with a full stop  
[1] TRUE
Code
# Without escaping, . matches any character:  
str_detect("abc", ".")     # TRUE — any character matches  
[1] TRUE
Code
str_detect("abc", "\\.")   # FALSE — no literal dot in "abc"  
[1] FALSE
The Double Backslash in R

In most programming languages, a single backslash \ is the regex escape character. In R strings, \ itself must be escaped, so regex escapes require double backslash \\. For example:

  • \\. in R code → \. as a regex → matches a literal dot
  • \\b in R code → \b as a regex → matches a word boundary
  • \\d in R code → \d as a regex → matches a digit

This double-backslash requirement catches many beginners. Remember: every \ you intend for regex needs to be written as \\ in R.

Anchors

Anchors match positions in the string, not characters. They constrain where in the string a pattern can match.

Code
# ^ matches the START of the string  
str_detect(sent, "^The")     # TRUE — "The" is at the start  
[1] TRUE
Code
str_detect(sent, "^cat")     # FALSE — "cat" is not at the start  
[1] FALSE
Code
# $ matches the END of the string  
str_detect(sent, "mat\\.$")  # TRUE — "mat." is at the end  
[1] TRUE
Code
str_detect(sent, "cat$")     # FALSE — "cat" is not at the end  
[1] FALSE
Code
# \b matches a WORD BOUNDARY (between a word char and a non-word char)  
str_detect("catalogue", "\\bcat\\b")   # FALSE — "cat" is part of a word  
[1] FALSE
Code
str_detect("the cat sat", "\\bcat\\b") # TRUE — "cat" is a whole word  
[1] TRUE
Code
# \B matches where \b does NOT (i.e., inside a word)  
str_detect("catalogue", "\\Bcat\\B")   # FALSE — "cat" is at word START  
[1] FALSE
Code
str_detect("concatenate", "\\Bcat\\B") # TRUE — "cat" is in the middle  
[1] TRUE
Word Boundaries in Corpus Searches

\b is indispensable for corpus searches. Without it, searching for “the” would match “the” inside “other”, “there”, “ather”, and so on. Always use \\bword\\b when you want whole-word matches.

Character Classes

A character class [...] matches any single character from the set listed inside the brackets:

Code
# Match 'c', 's', or 'm' followed by 'at'  
str_extract_all(sent, "[csm]at")  
[[1]]
[1] "cat" "sat" "mat"
Code
# Negated class [^...]: match any character NOT in the set  
str_extract_all(sent, "[^aeiou ]")   # non-vowel, non-space characters  
[[1]]
 [1] "T" "h" "c" "t" "s" "t" "n" "t" "h" "m" "t" "."
Code
# Ranges  
str_extract_all("Hello World 123", "[a-z]")   # lowercase letters  
[[1]]
[1] "e" "l" "l" "o" "o" "r" "l" "d"
Code
str_extract_all("Hello World 123", "[A-Z]")   # uppercase letters  
[[1]]
[1] "H" "W"
Code
str_extract_all("Hello World 123", "[0-9]")   # digits  
[[1]]
[1] "1" "2" "3"
Code
str_extract_all("Hello World 123", "[a-zA-Z]") # all letters  
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"

POSIX Character Classes

R supports POSIX character classes — named sets written inside [:..:] inside an outer [...]:

Code
str_extract_all("Hello, World! 123.", "[[:alpha:]]")  # letters only  
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
Code
str_extract_all("Hello, World! 123.", "[[:digit:]]")  # digits only  
[[1]]
[1] "1" "2" "3"
Code
str_extract_all("Hello, World! 123.", "[[:punct:]]")  # punctuation only  
[[1]]
[1] "," "!" "."
Code
str_extract_all("Hello, World! 123.", "[[:alnum:]]")  # letters and digits  
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d" "1" "2" "3"
Code
str_extract_all("Hello\tWorld  123",  "[[:blank:]]")  # spaces and tabs  
[[1]]
[1] "\t" " "  " " 

The full set of POSIX classes available in R:

Class

Matches

[:alpha:]

Any letter (a–z, A–Z)

[:lower:]

Lowercase letters (a–z)

[:upper:]

Uppercase letters (A–Z)

[:digit:]

Digits (0–9)

[:alnum:]

Letters and digits

[:punct:]

Punctuation: . , ; : ! ? " ' ( ) [ ] { } / \ @ # $ % ^ & * - _ + = ~ ` |

[:space:]

All whitespace: space, tab, newline, return, form-feed

[:blank:]

Space and tab only

[:graph:]

All visible characters (alnum + punct)

[:print:]

Printable characters (graph + space)

Quantifiers

Quantifiers specify how many times the preceding element should match:

Code
# * : 0 or more  
str_extract_all("aabbbcccc", "b*")   # matches "", "bbb", ""...  
[[1]]
[1] ""    ""    "bbb" ""    ""    ""    ""    ""   
Code
# + : 1 or more  
str_extract_all("aabbbcccc", "b+")   # matches "bbb"  
[[1]]
[1] "bbb"
Code
# ? : 0 or 1 (makes the element optional)  
str_detect(c("color", "colour"), "colou?r")   # both TRUE  
[1] TRUE TRUE
Code
# {n} : exactly n  
str_extract_all(et, "[a-z]{10}")     # exactly 10-letter sequences  
[[1]]
 [1] "communicat" "intuitivel" "grammatica" "instructio" "rescriptiv"
 [6] "descriptiv" "understand" "acquisitio" "morphology" "investigat"
[11] "influences" "interpreta" "omputation" "linguistic" "processing"
[16] "expression" "extracting"
Code
# {n,} : n or more  
str_extract_all(tokens, "^[[:alpha:]]{8,}$")  # words of 8+ letters  
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
character(0)

[[5]]
character(0)

[[6]]
character(0)

[[7]]
character(0)

[[8]]
character(0)

[[9]]
[1] "sometimes"

[[10]]
[1] "describe"

[[11]]
character(0)

[[12]]
character(0)

[[13]]
character(0)

[[14]]
character(0)

[[15]]
character(0)

[[16]]
character(0)

[[17]]
character(0)

[[18]]
character(0)

[[19]]
character(0)

[[20]]
character(0)

[[21]]
character(0)

[[22]]
[1] "language"

[[23]]
character(0)

[[24]]
character(0)

[[25]]
character(0)

[[26]]
character(0)

[[27]]
character(0)

[[28]]
character(0)

[[29]]
character(0)

[[30]]
character(0)

[[31]]
character(0)

[[32]]
character(0)

[[33]]
character(0)

[[34]]
[1] "somebody"

[[35]]
character(0)

[[36]]
character(0)

[[37]]
character(0)

[[38]]
character(0)

[[39]]
character(0)

[[40]]
character(0)

[[41]]
character(0)

[[42]]
character(0)

[[43]]
character(0)

[[44]]
character(0)

[[45]]
character(0)

[[46]]
character(0)

[[47]]
character(0)

[[48]]
character(0)

[[49]]
character(0)

[[50]]
character(0)

[[51]]
[1] "languages"

[[52]]
character(0)

[[53]]
character(0)

[[54]]
character(0)

[[55]]
character(0)

[[56]]
character(0)

[[57]]
[1] "Languages"

[[58]]
character(0)

[[59]]
character(0)

[[60]]
character(0)

[[61]]
character(0)

[[62]]
character(0)

[[63]]
[1] "communicate"

[[64]]
character(0)

[[65]]
character(0)

[[66]]
character(0)

[[67]]
[1] "Grammars"

[[68]]
[1] "developed"

[[69]]
character(0)

[[70]]
character(0)

[[71]]
character(0)

[[72]]
character(0)

[[73]]
character(0)

[[74]]
[1] "described"

[[75]]
character(0)

[[76]]
character(0)

[[77]]
character(0)

[[78]]
character(0)

[[79]]
character(0)

[[80]]
[1] "Languages"

[[81]]
character(0)

[[82]]
character(0)

[[83]]
character(0)

[[84]]
character(0)

[[85]]
character(0)

[[86]]
character(0)

[[87]]
[1] "Children"

[[88]]
character(0)

[[89]]
character(0)

[[90]]
character(0)

[[91]]
character(0)

[[92]]
character(0)

[[93]]
character(0)

[[94]]
[1] "language"

[[95]]
character(0)

[[96]]
character(0)

[[97]]
character(0)

[[98]]
character(0)

[[99]]
character(0)

[[100]]
character(0)

 [ reached getOption("max.print") -- omitted 110 entries ]
Code
# {n,m} : between n and m  
str_extract_all(tokens, "^[[:alpha:]]{4,6}$") # words of 4–6 letters  
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "system"

[[5]]
character(0)

[[6]]
character(0)

[[7]]
character(0)

[[8]]
[1] "People"

[[9]]
character(0)

[[10]]
character(0)

[[11]]
character(0)

[[12]]
character(0)

[[13]]
character(0)

[[14]]
[1] "rules"

[[15]]
character(0)

[[16]]
character(0)

[[17]]
character(0)

[[18]]
character(0)

[[19]]
character(0)

[[20]]
[1] "fact"

[[21]]
character(0)

[[22]]
character(0)

[[23]]
character(0)

[[24]]
character(0)

[[25]]
character(0)

[[26]]
character(0)

[[27]]
character(0)

[[28]]
character(0)

[[29]]
[1] "word"

[[30]]
character(0)

[[31]]
character(0)

[[32]]
character(0)

[[33]]
[1] "that"

[[34]]
character(0)

[[35]]
character(0)

[[36]]
character(0)

[[37]]
[1] "rules"

[[38]]
[1] "first"

[[39]]
character(0)

[[40]]
[1] "then"

[[41]]
[1] "spoke"

[[42]]
character(0)

[[43]]
character(0)

[[44]]
[1] "like"

[[45]]
character(0)

[[46]]
[1] "rules"

[[47]]
character(0)

[[48]]
character(0)

[[49]]
character(0)

[[50]]
character(0)

[[51]]
character(0)

[[52]]
character(0)

[[53]]
character(0)

[[54]]
[1] "start"

[[55]]
[1] "like"

[[56]]
character(0)

[[57]]
character(0)

[[58]]
character(0)

[[59]]
[1] "when"

[[60]]
[1] "humans"

[[61]]
character(0)

[[62]]
character(0)

[[63]]
character(0)

[[64]]
[1] "with"

[[65]]
[1] "each"

[[66]]
character(0)

[[67]]
character(0)

[[68]]
character(0)

[[69]]
character(0)

[[70]]
[1] "After"

[[71]]
[1] "some"

[[72]]
character(0)

[[73]]
[1] "people"

[[74]]
character(0)

[[75]]
character(0)

[[76]]
character(0)

[[77]]
character(0)

[[78]]
[1] "their"

[[79]]
character(0)

[[80]]
character(0)

[[81]]
[1] "change"

[[82]]
[1] "over"

[[83]]
character(0)

[[84]]
character(0)

[[85]]
character(0)

[[86]]
character(0)

[[87]]
character(0)

[[88]]
[1] "learn"

[[89]]
character(0)

[[90]]
character(0)

[[91]]
character(0)

[[92]]
[1] "their"

[[93]]
[1] "first"

[[94]]
character(0)

[[95]]
character(0)

[[96]]
[1] "They"

[[97]]
character(0)

[[98]]
character(0)

[[99]]
[1] "need"

[[100]]
character(0)

 [ reached getOption("max.print") -- omitted 110 entries ]

Greedy vs. Lazy Matching

By default, quantifiers are greedy — they match as much as possible. Adding ? after a quantifier makes it lazy — it matches as little as possible:

Code
html <- "<b>bold</b> and <i>italic</i>"  
  
# Greedy: matches from first < to LAST >  
str_extract(html, "<.+>")  
[1] "<b>bold</b> and <i>italic</i>"
Code
# Lazy: matches from first < to next >  
str_extract(html, "<.+?>")  
[1] "<b>"
Code
# Extract each tag individually (lazy)  
str_extract_all(html, "<.+?>")  
[[1]]
[1] "<b>"  "</b>" "<i>"  "</i>"

Groups and Alternation

Parentheses () create a capturing group — a sub-pattern whose match can be referenced or extracted separately. The alternation operator | means OR within a group or pattern.

Code
# Alternation: match "cat" OR "dog"  
str_detect(c("I have a cat", "I have a dog", "I have a fish"),  
           "cat|dog")  
[1]  TRUE  TRUE FALSE
Code
# Alternation inside a group: match "colour" OR "color"  
str_extract_all(c("British colour", "American color"), "colo(u|)r")  
[[1]]
[1] "colour"

[[2]]
[1] "color"
Code
# Match all forms of "walk": walk, walks, walked, walking  
str_extract_all(et, "walk(s|ed|ing|er)?")  
[[1]]
character(0)
Code
# Groups allow repetition of a sub-pattern  
str_detect("abababab", "(ab)+")   # matches one or more "ab"  
[1] TRUE

Non-Capturing Groups

Use (?:...) when you need to group for alternation or quantification but do not need to capture the match:

Code
# Group for alternation without capturing  
str_extract_all(et, "(?:gram|morpho|phono)logy")  
[[1]]
[1] "morphology"

Special Escape Sequences

R supports shorthand escape sequences for common character classes:

Sequence (in R code)

Matches

Example (R string)

\\w

Word characters: [[:alnum:]_]

"\\w+"

\\W

Non-word characters: [^[:alnum:]_]

"\\W+"

\\d

Digits: [[:digit:]]

"\\d+"

\\D

Non-digits: [^[:digit:]]

"\\D+"

\\s

Whitespace: [[:space:]]

"\\s+"

\\S

Non-whitespace: [^[:space:]]

"\\S+"

\\b

Word boundary (position)

"\\bcat\\b"

\\B

Non-word boundary (position)

"\\Bcat\\B"

Code
# \w: word characters  
str_extract_all("price: $4.99!", "\\w+")  
[[1]]
[1] "price" "4"     "99"   
Code
# \d: digits  
str_extract_all("Call 07 3365 1234 or 07 3346 5678", "\\d+")  
[[1]]
[1] "07"   "3365" "1234" "07"   "3346" "5678"
Code
# \s: whitespace (useful for splitting on any whitespace)  
str_split("word1   word2\tword3\nword4", "\\s+")[[1]]  
[1] "word1" "word2" "word3" "word4"
Code
# \b: whole-word match  
str_extract_all("grammar, grammarian, ungrammatical", "\\bgrammar\\b")  
[[1]]
[1] "grammar"

Lookahead and Lookbehind

Lookaround assertions match a position based on what comes before or after it, without including that context in the match. They are essential for extracting values that are preceded or followed by specific markers.

Syntax

Name

Matches

(?=...)

Positive lookahead

Position followed by ...

(?!...)

Negative lookahead

Position NOT followed by ...

(?<=...)

Positive lookbehind

Position preceded by ...

(?<!...)

Negative lookbehind

Position NOT preceded by ...

Code
prices <- c("$12.99", "$4.50", "USD 7.00", "8.95 EUR")  
  
# Positive lookahead: match digits followed by a dot  
str_extract_all(prices, "\\d+(?=\\.)")  
[[1]]
[1] "12"

[[2]]
[1] "4"

[[3]]
[1] "7"

[[4]]
[1] "8"
Code
# Positive lookbehind: match digits preceded by "$"  
str_extract_all(prices, "(?<=\\$)\\d+\\.\\d+")  
[[1]]
[1] "12.99"

[[2]]
[1] "4.50"

[[3]]
character(0)

[[4]]
character(0)
Code
# Negative lookbehind: match numbers NOT preceded by "$"  
str_extract_all(prices, "(?<!\\$)\\b\\d+\\.\\d+")  
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "7.00"

[[4]]
[1] "8.95"

A linguistic example — extract words that come before a comma:

Code
sample_text <- "Grammar, syntax, and morphology are core subfields of linguistics."  
str_extract_all(sample_text, "\\w+(?=,)")  
[[1]]
[1] "Grammar" "syntax" 

Exercises: Regex Patterns

Q1. What does the regex ^[A-Z] match?






Q2. What is the difference between colou?r and colo[u]?r?






Q3. You want to match words of exactly 5 characters that consist only of lowercase letters. Which pattern is correct?






Key stringr Functions

Section Overview

What you’ll learn: The stringr functions used most frequently with regular expressions, and when to use each

Key functions: str_detect(), str_count(), str_extract(), str_extract_all(), str_replace(), str_replace_all(), str_remove(), str_remove_all(), str_split(), str_locate()

The stringr package provides a consistent, user-friendly interface to regular expressions in R. All stringr functions follow the same pattern: the string comes first, the pattern second.

str_detect() — Does the Pattern Exist?

Returns TRUE/FALSE for each string in a vector. Most commonly used for filtering:

Code
words_sample <- c("grammar", "syntax", "morphology", "phonology",  
                  "pragmatics", "grammarian", "ungrammatical")  
  
# Which words contain "gram"?  
str_detect(words_sample, "gram")  
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
Code
# Which words start with a vowel?  
str_detect(words_sample, "^[aeiou]")  
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Code
# Negate with !  
words_sample[!str_detect(words_sample, "gram")]  
[1] "syntax"     "morphology" "phonology"  "pragmatics"

str_count() — How Many Times?

Counts non-overlapping occurrences of a pattern within each string:

Code
# How many vowels in each word?  
str_count(words_sample, "[aeiou]")  
[1] 2 1 3 3 3 4 5
Code
# How many times does "a" appear in the example text?  
str_count(et, "\\ba\\b")   # the word "a" as a whole word  
[1] 5

str_extract() and str_extract_all() — What Matches?

str_extract() returns the first match in each string. str_extract_all() returns all matches as a list:

Code
# Extract the first sequence of 3+ consonants  
str_extract(words_sample, "[^aeiou]{3,}")  
[1] NA     "synt" "rph"  NA     NA     NA     "ngr" 
Code
# Extract all sequences of digits from a mixed string  
mixed <- c("price: 12.99 dollars", "code: A4-B12", "year: 2024")  
str_extract_all(mixed, "\\d+")  
[[1]]
[1] "12" "99"

[[2]]
[1] "4"  "12"

[[3]]
[1] "2024"
Code
# Extract all words longer than 8 characters from the example text  
long_words <- str_extract_all(et, "\\b[[:alpha:]]{9,}\\b")[[1]]  
sort(unique(long_words))  
 [1] "acquisition"    "communicate"    "Computational"  "described"     
 [5] "describes"      "descriptive"    "developed"      "expressions"   
 [9] "extracting"     "grammatically"  "influences"     "instruction"   
[13] "interpretation" "intuitively"    "investigates"   "languages"     
[17] "Languages"      "linguistics"    "Linguists"      "morphology"    
[21] "naturally"      "Phonology"      "Pragmatics"     "Prescriptive"  
[25] "processing"     "searching"      "sometimes"      "structure"     
[29] "understand"    

str_replace() and str_replace_all()

Replace the first (or all) occurrence(s) of a pattern with a replacement string. Backreferences (\\1, \\2) refer to captured groups in the replacement:

Code
# Replace first match  
str_replace(sent, "[csm]at", "dog")  
[1] "The dog sat on the mat."
Code
# Replace all matches  
str_replace_all(sent, "[csm]at", "dog")  
[1] "The dog dog on the dog."
Code
# Backreference: reverse the order of two words separated by "and"  
str_replace_all("cats and dogs", "(\\w+) and (\\w+)", "\\2 and \\1")  
[1] "dogs and cats"
Code
# Add emphasis around all long words  
str_replace_all(et, "\\b[[:alpha:]]{8,}\\b", "**\\0**") |>  
  substr(1, 120)   # show first 120 characters  
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"

str_remove() and str_remove_all()

Shorthand for str_replace(x, pattern, "") and str_replace_all(x, pattern, ""):

Code
# Remove all punctuation from the sentence  
str_remove_all(sent, "[[:punct:]]")  
[1] "The cat sat on the mat"
Code
# Remove all digits  
str_remove_all("Call us on 07-3365-1234", "\\d")  
[1] "Call us on --"
Code
# Remove trailing whitespace  
str_remove_all("  linguistics  ", "^\\s+|\\s+$")  
[1] "linguistics"
Code
# Remove words shorter than 4 characters from tokens  
long_tokens <- tokens[str_detect(tokens, "^[[:alpha:]]{4,}$")]  
head(long_tokens, 10)  
 [1] "Grammar"   "system"    "People"    "sometimes" "describe"  "grammar"  
 [7] "rules"     "fact"      "language"  "word"     

str_split()

Split strings on a pattern, returning a list:

Code
# Split on whitespace  
str_split("the cat sat on the mat", "\\s+")[[1]]  
[1] "the" "cat" "sat" "on"  "the" "mat"
Code
# Split on punctuation or whitespace  
str_split("one,two; three    four", "[[:punct:]\\s]+")[[1]]  
[1] "one"   "two"   "three" "four" 
Code
# Split a text into sentences (approximate)  
sentences <- str_split(et, "(?<=[.!?])\\s+")[[1]]  
head(sentences, 3)  
[1] "Grammar is the system of a language."                                                                                             
[2] "People sometimes describe grammar as the rules of a language, but in fact no language has rules."                                 
[3] "If we use the word rules, we suggest that somebody created the rules first and then spoke the language, like the rules of a game."

str_locate() — Where Is the Match?

Returns the start and end positions of matches — useful when you need to know where in the string a pattern occurs:

Code
# Find where "grammar" first occurs in the example text  
str_locate(et, "grammar")  
     start end
[1,]    64  70
Code
# Find all occurrences  
str_locate_all(et, "\\bgrammar\\b")[[1]]  
     start  end
[1,]    64   70
[2,]   442  448
[3,]   538  544
[4,]   728  734
[5,]   786  792
[6,]   847  853
[7,]  1237 1243

Exercises: stringr Functions

Q1. What is the difference between str_extract() and str_extract_all()?






Q2. You want to capitalise all words longer than 5 characters in a text. Which stringr function would you use?






Practical Applications

Section Overview

What you’ll learn: How to apply regular expressions to realistic corpus linguistics and text processing tasks

Tasks covered: Corpus searching, text cleaning, extraction, frequency analysis, and dplyr integration

Searching a Corpus: Concordance-Style Extraction

A common corpus task is retrieving all contexts in which a pattern appears. We simulate a small multi-document corpus:

Code
set.seed(42)  
  
corpus <- data.frame(  
  doc_id   = paste0("doc", 1:10),  
  register = rep(c("Academic", "News"), each = 5),  
  text     = c(  
    "Grammar is the systematic study of the structure of a language.",  
    "Morphology examines how words are formed from smaller units called morphemes.",  
    "Syntax deals with the arrangement of words to form grammatical sentences.",  
    "Phonology studies the sound systems and phonological rules of languages.",  
    "Pragmatics investigates how context and intention affect meaning in communication.",  
    "Scientists announced a major breakthrough in natural language processing yesterday.",  
    "The new grammar checker software was released to the public on Monday morning.",  
    "Researchers found that bilingual speakers process syntax differently than monolinguals.",  
    "Language acquisition in children follows predictable phonological and syntactic stages.",  
    "The government launched a literacy program to improve grammar skills in schools."  
  ),  
  stringsAsFactors = FALSE  
)  
Code
# Find all documents containing words ending in "-ology"  
corpus |>  
  dplyr::filter(str_detect(text, "\\b\\w+ology\\b")) |>  
  dplyr::select(doc_id, register, text)  
  doc_id register
1   doc2 Academic
2   doc4 Academic
                                                                           text
1 Morphology examines how words are formed from smaller units called morphemes.
2      Phonology studies the sound systems and phonological rules of languages.
Code
# Extract all "-ology" words from each document  
corpus |>  
  dplyr::mutate(  
    ology_words = sapply(text, function(t)  
      paste(str_extract_all(t, "\\b\\w+ology\\b")[[1]], collapse = ", "))  
  ) |>  
  dplyr::filter(ology_words != "") |>  
  dplyr::select(doc_id, ology_words)  
  doc_id ology_words
1   doc2  Morphology
2   doc4   Phonology

Counting Pattern Frequencies

Code
# Count occurrences of "grammar" (case-insensitive) per document  
corpus |>  
  dplyr::mutate(  
    n_grammar = str_count(text, regex("grammar", ignore_case = TRUE))  
  ) |>  
  dplyr::select(doc_id, register, n_grammar) |>  
  dplyr::arrange(dplyr::desc(n_grammar))  
   doc_id register n_grammar
1    doc1 Academic         1
2    doc7     News         1
3   doc10     News         1
4    doc2 Academic         0
5    doc3 Academic         0
6    doc4 Academic         0
7    doc5 Academic         0
8    doc6     News         0
9    doc8     News         0
10   doc9     News         0
Code
# Count different syntactic subfields mentioned  
subfields <- c("syntax", "morphology", "phonology", "pragmatics", "grammar")  
subfield_counts <- sapply(subfields, function(sf)  
  sum(str_count(corpus$text, regex(sf, ignore_case = TRUE))))  
  
data.frame(subfield = subfields, count = subfield_counts) |>  
  dplyr::arrange(dplyr::desc(count)) |>  
  flextable() |>  
  flextable::set_table_properties(width = .4, layout = "autofit") |>  
  flextable::theme_zebra() |>  
  flextable::fontsize(size = 12) |>  
  flextable::fontsize(size = 12, part = "header") |>  
  flextable::align_text_col(align = "center") |>  
  flextable::set_caption(caption = "Frequency of linguistic subfield terms in the corpus.") |>  
  flextable::border_outer()  

subfield

count

grammar

3

syntax

2

morphology

1

phonology

1

pragmatics

1

Text Cleaning

Regular expressions are the primary tool for cleaning raw corpus text:

Code
raw_texts <- c(  
  "   Grammar  is the  system   of a language.   ",  
  "Words like 'cat', 'bat', and 'hat' rhyme!",  
  "Phone: +61-7-3365-1234  |  Email: info@uq.edu.au",  
  "Chapter 4: Syntax (pp. 112--145) — see also §3.2",  
  "The year\t2024\twas notable for advances in NLP."  
)  
  
raw_texts |>  
  # Normalise whitespace (collapse multiple spaces/tabs to one space)  
  str_replace_all("\\s+", " ") |>  
  # Remove leading and trailing whitespace  
  str_trim() |>  
  # Remove content in parentheses  
  str_remove_all("\\(.*?\\)") |>  
  # Remove section references (§3.2 etc.)  
  str_remove_all(\\d+\\.\\d+") |>  
  # Remove em dashes and extra spaces left behind  
  str_remove_all("—\\s*") |>  
  # Trim again after removals  
  str_trim()  
[1] "Grammar is the system of a language."          
[2] "Words like 'cat', 'bat', and 'hat' rhyme!"     
[3] "Phone: +61-7-3365-1234 | Email: info@uq.edu.au"
[4] "Chapter 4: Syntax  see also"                   
[5] "The year 2024 was notable for advances in NLP."

Extracting Structured Information

A powerful application of regex is extracting structured information from free text:

Code
# Simulate file names with embedded metadata  
file_names <- c(  
  "speaker01_female_academic_2019.txt",  
  "speaker14_male_news_2021.txt",  
  "speaker07_female_fiction_2020.txt",  
  "speaker23_male_academic_2022.txt"  
)  
  
# Extract each metadata component  
data.frame(  
  filename   = file_names,  
  speaker_id = str_extract(file_names, "speaker\\d+"),  
  gender     = str_extract(file_names, "(?<=_)(female|male)(?=_)"),  
  register   = str_extract(file_names, "(?<=_(female|male)_)\\w+"),  
  year       = str_extract(file_names, "\\d{4}")  
)  
                            filename speaker_id gender      register year
1 speaker01_female_academic_2019.txt  speaker01 female academic_2019 2019
2       speaker14_male_news_2021.txt  speaker14   male     news_2021 2021
3  speaker07_female_fiction_2020.txt  speaker07 female  fiction_2020 2020
4   speaker23_male_academic_2022.txt  speaker23   male academic_2022 2022

Case-Insensitive Matching

By default, regex in stringr is case-sensitive. Use regex(..., ignore_case = TRUE) to match regardless of case:

Code
# Match "Grammar", "GRAMMAR", "grammar" etc.  
str_detect(c("Grammar", "GRAMMAR", "grammar", "GrAmMaR"),  
           regex("grammar", ignore_case = TRUE))  
[1] TRUE TRUE TRUE TRUE
Code
# Extract all mentions of a term regardless of capitalisation  
str_extract_all(et, regex("\\bgrammar\\w*\\b", ignore_case = TRUE))[[1]]  
 [1] "Grammar"  "grammar"  "Grammars" "grammar"  "Grammar"  "grammar" 
 [7] "grammar"  "grammar"  "grammar"  "grammars" "grammar" 

Regex in dplyr Pipelines

Regular expressions integrate seamlessly with dplyr for filtering and creating new columns:

Code
corpus |>  
  # Filter: keep only documents mentioning a specific pattern  
  dplyr::filter(str_detect(text, regex("syntax|morphology", ignore_case = TRUE))) |>  
  # Mutate: extract the first linguistic subfield mentioned  
  dplyr::mutate(  
    primary_topic = str_extract(text,  
      regex("syntax|morphology|phonology|pragmatics|grammar",  
            ignore_case = TRUE)),  
    n_words = str_count(text, "\\S+"),  
    has_definition = str_detect(text, "\\bis\\b|\\bdeals with\\b|\\bexamines\\b")  
  ) |>  
  dplyr::select(doc_id, register, primary_topic, n_words, has_definition)  
  doc_id register primary_topic n_words has_definition
1   doc2 Academic    Morphology      11           TRUE
2   doc3 Academic        Syntax      11           TRUE
3   doc8     News        syntax      10          FALSE

Exercises: Practical Applications

Q1. What regular expression would you use to extract all words that contain at least one digit (e.g., “A4”, “mp3”, “COVID-19”)?






Q2. You want to extract the domain name from email addresses (the part after @ and before the final .). Which regex extracts uq from user@uq.edu.au?






Q3. What does str_replace_all(text, \"(\\\\w+) and (\\\\w+)\", \"\\\\2 and \\\\1\") do?






Corpus Search Exercises

Section Overview

Ten practical exercises covering the most common corpus-search regex tasks

Each question asks you to identify the correct regular expression for a realistic search task on a tokenised text vector. All answers use stringr::str_detect() applied to a character vector called text.


Q1. Which regex extracts all forms of walk from a tokenised text (walk, walks, walked, walking, walker)?






Q2. Which regex extracts all words beginning with “un” (e.g., ungrammatical, unusual, undo)?






Q3. Which regex finds all numeric tokens (whole numbers like 2024, 42, 100)?






Q4. Which regex extracts all words ending in -ing (e.g., running, working, thinking)?






Q5. Which regex matches email addresses (e.g., cat@uq.edu.au, info@ladal.edu.au)?






Q6. Which regex identifies tokens that contain at least one digit mixed with letters (e.g., mp3, A4, COVID-19, type2)?






Q7. Which regex extracts hyphenated compound words (e.g., well-being, self-aware, long-term)?






Q8. Which regex finds capitalised tokens — words beginning with an uppercase letter followed by lowercase letters (e.g., proper nouns like London, Paris, Grammar)?






Q9. Which regex finds tokens that are questions ending with a question mark (e.g., you?, this?)?






Q10. Which regex finds tokens containing double vowels (e.g., agreement, book, see, moon)?






Quick Reference

Section Overview

A compact reference for the most commonly used regex elements in R

Pattern Summary Table

Pattern

Meaning

.

Any character except newline

^

Start of string / line

$

End of string / line

\\b

Word boundary

\\B

Non-word boundary

[abc]

One of: a, b, or c

[^abc]

Not a, b, or c

[a-z]

Lowercase letter

[[:alpha:]]

Any letter

[[:digit:]]

Any digit

[[:punct:]]

Any punctuation

*

0 or more (greedy)

+

1 or more (greedy)

?

0 or 1 (optional)

{n}

Exactly n times

{n,}

n or more times

{n,m}

Between n and m times

(abc)

Capturing group

(?:abc)

Non-capturing group

a|b

a or b

\\w

Word character [a-zA-Z0-9_]

\\d

Digit [0-9]

\\s

Whitespace

\\W

Non-word character

\\D

Non-digit

\\S

Non-whitespace

(?=...)

Positive lookahead

(?!...)

Negative lookahead

(?<=...)

Positive lookbehind

(?<!...)

Negative lookbehind

stringr Function Summary

Function

Returns

str_detect(x, p)

logical vector — does p match?

str_count(x, p)

integer vector — how many matches?

str_extract(x, p)

character vector — first match (NA if none)

str_extract_all(x, p)

list of character vectors — all matches

str_replace(x, p, r)

character vector — first match replaced

str_replace_all(x, p, r)

character vector — all matches replaced

str_remove(x, p)

character vector — first match removed

str_remove_all(x, p)

character vector — all matches removed

str_split(x, p)

list of character vectors — parts between matches

str_locate(x, p)

integer matrix — start and end of first match

str_locate_all(x, p)

list of integer matrices — all match positions

str_starts(x, p)

logical — does x start with p?

str_ends(x, p)

logical — does x end with p?


Citation & Session Info

Schweinberger, Martin. 2026. Regular Expressions in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.02.19).

@manual{schweinberger2026regex,  
  author       = {Schweinberger, Martin},  
  title        = {Regular Expressions in R},  
  note         = {https://ladal.edu.au/tutorials/regex/regex.html},  
  year         = {2026},  
  organization = {The University of Queensland, Australia. School of Languages and Cultures},  
  address      = {Brisbane},  
  edition      = {2026.02.19}  
}  
Code
sessionInfo()  
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] checkdown_0.0.13 flextable_0.9.7  lubridate_1.9.4  forcats_1.0.0   
 [5] stringr_1.5.1    dplyr_1.2.0      purrr_1.0.4      readr_2.1.5     
 [9] tidyr_1.3.2      tibble_3.2.1     ggplot2_4.0.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] generics_0.1.3          fontLiberation_0.1.0    renv_1.1.1             
 [4] xml2_1.3.6              stringi_1.8.4           hms_1.1.3              
 [7] digest_0.6.39           magrittr_2.0.3          evaluate_1.0.3         
[10] grid_4.4.2              timechange_0.3.0        RColorBrewer_1.1-3     
[13] fastmap_1.2.0           jsonlite_1.9.0          zip_2.3.2              
[16] scales_1.4.0            fontBitstreamVera_0.1.1 codetools_0.2-20       
[19] textshaping_1.0.0       cli_3.6.4               rlang_1.1.7            
[22] fontquiver_0.2.1        litedown_0.9            commonmark_2.0.0       
[25] withr_3.0.2             yaml_2.3.10             gdtools_0.4.1          
[28] tools_4.4.2             officer_0.6.7           uuid_1.2-1             
[31] tzdb_0.4.0              vctrs_0.7.1             R6_2.6.1               
[34] lifecycle_1.0.5         htmlwidgets_1.6.4       ragg_1.3.3             
[37] pkgconfig_2.0.3         pillar_1.10.1           gtable_0.3.6           
[40] glue_1.8.0              data.table_1.17.0       Rcpp_1.0.14            
[43] systemfonts_1.2.1       xfun_0.56               tidyselect_1.2.1       
[46] rstudioapi_0.17.1       knitr_1.51              farver_2.1.2           
[49] htmltools_0.5.9         rmarkdown_2.30          compiler_4.4.2         
[52] S7_0.2.1                markdown_2.0            askpass_1.2.1          
[55] openssl_2.3.2          

Back to top

Back to HOME


References

Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.